Home

Column

Project Overview

We decided to analyze Sports data in this project, specifically data from Formula 1 races. We have included a brief overview of the data as well as a data dictionary in the following sections.

Following this are sections that attempt to use various methods of regression and classification to answer some research questions about the data. The methods included are:

  • Multiple Linear regression
  • Ridge Regression
  • LOESS, or natural cubic spline fit
  • kNN classification
  • Naive Bayes Classification
  • Logistic Regression Classification

These methods were split equally between the partners. Nandini Bhelke worked on the first three regression methods, while Kevin Hallissey worked on the last three classification methods.

Citations

F1 Data and Images:

MLR References:

Ridge Regression References:

Column

Formula 1

Data Overview

Row

Data Source and Processing

The data for this project was sourced from OpenF1 (https://openf1.org/). OpenF1 is a free and open-source API that provides real-time and historical Formula 1 data. The data from this website can be accessed in either JSON or CSV formats through the web browser. The appropriate CSV files were sourced from this website and processed to form a singular, complete dataset for this Project. The specific files used were:

  • Drivers : Provides information about drivers for each session.
  • Laps : Provides detailed information about individual laps.
  • Meetings : Provides information about meetings. A meeting refers to a Grand Prix or testing weekend and usually includes multiple sessions (practice, qualifying, race, …).
  • Position : Provides driver positions throughout a session, including initial placement and subsequent changes.
  • Sessions : Provides information about sessions. A session refers to a distinct period of track activity during a Grand Prix or testing weekend (practice, qualifying, sprint, race, …).

These separate CSV files were then processed and combined to form our final dataset. A data dictionary with all variables is provided to the right. The complete dataset can also be accessed at this link: https://drive.google.com/file/d/1Bvc_8Os35966WgIHAyXb80IT2ddCGT2A/view?usp=drive_link

Row

Data dictionary

Variable Name Description
year The year the event takes place
meeting_key The unique identifier for the meeting
meeting_name The name of the meeting
meeting_official_name The official name of the meeting
session_key The unique identifier for the session
session_name The name of the session (Practice 1, Qualifying, Race, …)
country_key The unique identifier for the country where the event takes place
country_name The full name of the country where the event takes place
driver_number The unique number assigned to an F1 driver
first_name The driver’s first name
last_name The driver’s last name
name_acronym Three-letter acronym of the driver’s name
team_name Name of the driver’s team
avg_duration The average of the total time taken, in seconds, to complete the entire lap
avg_duration_sec1 The average of the time taken, in seconds, to complete the first sector of the lap
avg_duration_sec2 The average of the time taken, in seconds, to complete the second sector of the lap
avg_duration_sec3 The average of the time taken, in seconds, to complete the third sector of the lap
avg_i1_speed The average of the speed of the car, in km/h, at the first intermediate point on the track
avg_i2_speed The average of the speed of the car, in km/h, at the second intermediate point on the track
avg_st_speed The average speed of the car, in km/h, at the speed trap, which is a specific point on the track where the highest speeds are usually recorded
avg_duration_start The average of the total time taken, in seconds, to complete the entire lap for only the first 5 laps
avg_duration_sec1_start The average of the time taken, in seconds, to complete the first sector of the lap for only the first 5 laps
avg_duration_sec2_start The average of the time taken, in seconds, to complete the second sector of the lap for only the first 5 laps
avg_duration_sec3_start The average of the time taken, in seconds, to complete the third sector of the lap for only the first 5 laps
avg_i1_speed_start The average of the speed of the car, in km/h, at the first intermediate point on the track for only the first 5 laps
avg_i2_speed_start The average of the speed of the car, in km/h, at the second intermediate point on the track for only the first 5 laps
avg_st_speed_start The average speed of the car, in km/h, at the speed trap, which is a specific point on the track where the highest speeds are usually recorded for only the first 5 laps
max_speed The max speed of the car, in km/h, from speeds recorded at the speed trap
position Final position of the driver (starts at 1)

Multiple Regression

Column

Multiple Linear Regression Overview

Research Question: Can we predict the average speed trap speed over all laps in a Grand Prix Race using the speed trap speeds for the practice and qualifying laps?

For the Multiple Linear Regression (MLR), we focused on predicting speeds for a race using the practice and qualifying race speeds. Specifically, we focused on the speeds at the speed traps, which are a specific points on the track where the highest speeds are usually recorded. First, we divided the dataset into Practice/Qualifying Races and acquired the average speed trap speeds for those races. We then matched these with the average speed trap speeds for the Final Races to try to see if we could use the practice and qualifying races to predict speeds in the final races. Our original model equation was as follows:

\[ \textbf{Race Speed} = \beta_0 + \beta_1 \text{P}_1 + \beta_2 \text{P}_2 + \beta_3 \text{P}_3 + \beta_4 \text{Q}\]

where \(\text{P}_1\) corresponds to the average speed for all laps for Practice 1, \(\text{P}_2\) for Practice 2, \(\text{P}_3\) for Practice 3, and \(\text{Q}\) Qualifying laps.

After model selection using stepAIC(), our final model was:

\[\begin{align} \textbf{Race Speed} &= \beta_0 + \beta_1 \text{P}_2 + \beta_2 \text{P}_3 \\ &= 59.0857 + 0.6389 \textbf{P}_2 + 0.1901 \textbf{P}_3 \end{align}\]

The \(R^2\) values for both models are:
model R2
Full model 0.3849
AIC model 0.3870

The \(R^2\) value did increase after model selection, but only by a marginal amount. Therefore, we can see that the model does not account for all the variance in the actual data values. This was reflected in the \(R^2\) value of \(0.387\). Thus, we can conclude that our model only accounts for about 38.7% of the total variance in the actual values for race speeds.

The regression plots with correlations are also shown to the right with the response variable as average race speed in km/h. As seen in the plot, although there is some correlation between the variables, the variance is too high fo the model to accurately fit the data.

The assumptions of the multiple linear regression models were also checked. These plots are included to the right. The model seems to hold for all assumptions of equal variance for residuals, independence, and normality based on these plots. We also checked for multicollinearity by taking a closer look at the correlation between variables. The correlation plot is included to the right, which shows that multicollinearity was not an issue in this model since the correlations are under 0.8.

Conclusion: Although the model is valid, We can conclude that this is in fact not a good regression model, since we cannot accurately predict average race speeds based on the average speeds for practice and qualifying laps.

Column

Regression plots

Model Fit summary

Below is the summary of the model that was fit to the data.

Call:
lm(formula = race_speed ~ p2_speed + p3_speed, data = mlr_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-127.041  -14.345    3.637   17.877   59.784 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 59.08575   13.23099   4.466 1.04e-05 ***
p2_speed     0.63893    0.07000   9.128  < 2e-16 ***
p3_speed     0.19010    0.06123   3.105  0.00204 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.01 on 394 degrees of freedom
Multiple R-squared:  0.3901,    Adjusted R-squared:  0.387 
F-statistic:   126 on 2 and 394 DF,  p-value: < 2.2e-16

Assumptions

Correlation plot

Ridge Regression

Column

Ridge Regression Overview

Research Question: Can we predict the overall average lap duration using the average lap duration times of the track sections?

We decided to use Ridge Regression to try to predict the overall average lap duration using the average lap duration times at the three different sections in the track. We chose these variables for Ridge Regression due to multicollinearity, since the variables were highly correlated. This can be seen in the correlation matrix provided to the right.

The coefficients plot shows that higher choices for lambda values cause the beta values to approach zero. The minimum and optimal lambda values are also included in this plot. The Optimal Lambda plot also shows the process of picking the optimal lambda that minimizes the mean squared error.

The Model Fit summary to the right gives us the coefficients of the model using the Ridge Regression method. We can also see that the optimal lambda value is about 5.16. In order to determine the goodness of fit of this model, we used our fitted model to get the predicted y values. These were then used to calculate the sum of squared errors in order to get the \(R^2\) value, which is given below.

model R2
Ridge Regression model 0.8648704

This means that our Ridge Regression model accounts for 86.48% of the variance in the response variable, or the average lap duration.

Conclusion: Thus, we can conclude that this is a valid model to use in predicting the overall average lap duration using average lap duration times for each sector of the track.

Column

Coefficients plot

Optimal Lambda plot

Model Fit Summary

The optimal \(\lambda\) value using glmnet() is 5.162448.

The coefficients of the Ridge Regression model using the optimal lambda value are:

name val
(Intercept) 74.7236544
avg_duration_sec1 0.7981413
avg_duration_sec2 0.1900571
avg_duration_start 0.1726709
avg_duration_sec1_start -0.1170244
avg_duration_sec2_start -0.3116496

The \(R^2\) value of this model is 0.8648704

Assumptions

LOESS Fit

Column

LOESS Overview

Research Question: Can we predict the average speed recorded at the speed traps using the average speed recorded at position 2 in the track?

We decided to fit a LOESS regression to predict the average speed recorded at the speed traps over all laps (avg_st_speed) using the average speed recorded at position 2 on the track (avg_i2_speed). The fitted LOESS plot is given to the right. As we can see, there is some variance in this model fit, and it seems to contain some outliers. In order to assess the goodness of fit of this model, we calculated the mean squared error value using the residuals, which was 959.2609. In order to put this value into perspective, we fit a simple linear regression model to the same values and calculated the mean squared error of that to see if LOESS performed any better. These MSE values are were as follows:

method MSE
LOESS 959.2609
Simple Linear Regression 1093.7860

As we can see, the MSE value is lower for LOESS, so we can say that the LOESS model fits the data better. However, is this model even valid? To check the model validity, we made a Residuals vs Fitted plot to observe the variance os residuals, which is given to the right. Although clustered near the right side of the plot, possibly due to outliers, the residuals seem to be randomly scattered around zero, so we can say that the equal variance assumption holds. Next, we decided to check for normality, which raised some issues. Both the Normal Q-Q plot and the Shapiro Wilk Normality test were utilized, as seen to the right under the “Normality Assumption” section. The model did not pass the normality test, so we have to say that the model is actually invalid to draw any predictions from. Looking at the trends of the data based on other plots, we believe that some outliers might be interfering with the model fitting, and causing the normality to fail since the Q-Q plot shows only a few points near the origin that skew the normality.

Conclusion: This model is not a valid model to use for predicting the average speed recorded at the speed traps using the average speed recorded at position 2 in the track. In order to handle this, we can utilize methods such as Cook’s distance or the Box-Cox method to get rid of any outliers or high leverage points to get a more valid model fit.

Column

LOESS plot

Model Fit

LOESS Fit MSE: 959.2609027
Simple Linear Model Fit MSE: 1093.7857316

Call:
loess(formula = y ~ x, data = data.bind, span = span1, degree = degree, 
    family = family)

Number of Observations: 2563 
Equivalent Number of Parameters: 18.45 
Residual Standard Error: 31.13 
Trace of smoother matrix: 21.9  (exact)

Control settings:
  span     :  0.09256002 
  degree   :  1 
  family   :  gaussian
  surface  :  interpolate     cell = 0.2
  normalize:  TRUE
 parametric:  FALSE
drop.square:  FALSE 

Equal Variance Assumption

Normality Assumption



    Shapiro-Wilk normality test

data:  loessfit$residuals
W = 0.95263, p-value < 2.2e-16

kNN Classification

Column

Explanation

Research Question: Are we able to accurately predict whether a racer will place in the top 10 of a given race/practice run?

\(H_0\): The null model is our best model.

\(H_1\): A k-NN model is a more accurate model to predict the outcome of the racer (top-10).

In the k-Nearest Neighbors model we designed, we decided to include all non-redundant information (i.e. we eliminated numeric codes corresponding to other columns) to give our model the best chance of finding patterns. After encoding both the meeting names and locations into dummy variables, we ran the model and achieved a fairly high level of accuracy above what we had hoped for. Our best was achieving:

Accuracy K-Value
0.7069351 1

The first plot on the is a plot of the accuracies of k-values from 1-15. We can see a clear trend down so we did not feel the need to continue with higher-k values, especially since we were trying to predict “winners” out of groups of 20.

One indicator of why this model may have been successful is because of the nature of the data. On the right we have three scatterplots that show the times (standardized) plotted against each other and can see large amounts of grouping. This is ideal for the k-NN algorithm, especially considering it also is able to use the location variables to break down the data to even smaller clusters of neighbors. For a better idea of how well our model did, there is a visualization of the confusion matrix as the last plot on the right.

Since there are 20 positions, and the null model would simply choose yes or no for all racers, our model only had to beat an accuracy of 50%, however it achieved an amazing 98%. Thus, we have sufficient statistical evidence to state that a k-NN model is a superior model for predicting whether or not a racer will place top 10 in a given race/practice.

Summary Statistics for Non-Top 10

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :-3.46472 Min. :-2.20142 Min. :-3.254584 Min. :-2.261067 Min. :-2.93446 Min. :-8.06515
1st Qu.:-0.78539 1st Qu.:-0.95230 1st Qu.:-0.576076 1st Qu.:-0.532085 1st Qu.:-0.77858 1st Qu.:-0.50528
Median :-0.05654 Median : 0.01575 Median : 0.008087 Median :-0.162523 Median : 0.22632 Median : 0.15729
Mean :-0.10000 Mean :-0.09031 Mean : 0.008531 Mean : 0.003542 Mean : 0.02105 Mean : 0.02362
3rd Qu.: 0.43281 3rd Qu.: 0.56529 3rd Qu.: 0.722877 3rd Qu.: 0.314717 3rd Qu.: 0.83606 3rd Qu.: 0.75674
Max. : 4.87405 Max. : 3.58951 Max. : 2.697585 Max. : 4.623260 Max. : 2.09781 Max. : 1.71540

Summary Statistics for Top 10

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :-2.76598 Min. :-1.82850 Min. :-3.247409 Min. :-2.080458 Min. :-3.43133 Min. :-3.88864
1st Qu.:-0.57024 1st Qu.:-0.85903 1st Qu.:-0.643692 1st Qu.:-0.566149 1st Qu.:-0.72852 1st Qu.:-0.54257
Median : 0.24093 Median : 0.25554 Median : 0.005930 Median :-0.192201 Median : 0.21292 Median : 0.16065
Mean : 0.09619 Mean : 0.08687 Mean :-0.008206 Mean :-0.003407 Mean :-0.02025 Mean :-0.02272
3rd Qu.: 0.68983 3rd Qu.: 0.80170 3rd Qu.: 0.751916 3rd Qu.: 0.293268 3rd Qu.: 0.76496 3rd Qu.: 0.67046
Max. : 3.69800 Max. : 2.78013 Max. : 2.599732 Max. : 4.678484 Max. : 1.89016 Max. : 1.59535

Column

K Value Chart

Variable Slice 1

Variable Slice 2

Variable Slice 3

Confusion Matrix

Naive Bayes

Column

Explanation

Research Question: Is a Naive Bayes model better at predicting which round of practice a given entry is than the null model?

\(H_0\): The null model is the best model for predicting which round of practice an entry is.

\(H_1\): The Naive Bayes model is better at classifying which round of practice an entry is than the null model.

For the Naive Bayes model, we were interested to see if we could predict which round of practice an entry in the data was. For (almost) every race, there will be multiple rounds of practice where the drivers can warm up both the car and themselves to get ready to drive. We expected a higher average speed as the drivers got more practice in, so thus we expected Practice 1 (round 1) to have a lower transformed minimum and maximum than Practice 3 (log transform due to high spread). However, in our summary statistics tables for each, you can see that while the minimums for each numeric variable tend to increase, the maximums either decrease or stay roughly the same. After seeing this, we decided to also include the meeting names to try to help the model know what the distribution of practice rounds was for each location. You can see this in the second chart on the right.

Overall the model didn’t do very well which was to be expected. The numerical predictors had very little difference between each round of practice despite having such a large spread. The inclusion of the locations helped, especially with locations that only had 1 round of practice such as the US Grand Prix and the Qatar Grand Prix, however the model still had limited success. Looking at the confusion matrix which is the last chart on the right, we see that the model predicted Practice 2 most of the time, and was unable to differentiate Practice 3 from the other two Practice rounds. It had a decent detection rate of Practice 1, but overall the model accuracy was rather low:

Accuracy
0.4438903

While this is a fairly low accuracy, we can see that the highest percentage of practice rounds is Practice 1 at around 40%%, which would be the best the null model could achieve by always picking one classification. Thus, we have sufficient statistical evidence to reject the null hypothesis and say that the Naive Bayes model is a better model for predicting which Practice round a given entry is.

Summary Statistics for Practice 1

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :3.273 Min. :2.678 Min. :2.862 Min. :2.699 Min. :4.747 Min. :4.220
1st Qu.:4.913 1st Qu.:3.983 1st Qu.:3.521 1st Qu.:3.310 1st Qu.:5.252 1st Qu.:5.464
Median :5.041 Median :4.366 Median :3.646 Median :3.454 Median :5.435 Median :5.535
Mean :5.025 Mean :4.281 Mean :3.632 Mean :3.558 Mean :5.381 Mean :5.504
3rd Qu.:5.143 3rd Qu.:4.590 3rd Qu.:3.850 3rd Qu.:3.643 3rd Qu.:5.506 3rd Qu.:5.600
Max. :6.704 Max. :6.641 Max. :4.234 Max. :5.164 Max. :5.673 Max. :5.753

Summary Statistics for Practice 2

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :4.156 Min. :2.643 Min. :2.960 Min. :3.032 Min. :4.974 Min. :5.090
1st Qu.:4.854 1st Qu.:4.058 1st Qu.:3.504 1st Qu.:3.286 1st Qu.:5.249 1st Qu.:5.483
Median :4.945 Median :4.267 Median :3.618 Median :3.408 Median :5.400 Median :5.544
Mean :4.987 Mean :4.187 Mean :3.598 Mean :3.571 Mean :5.365 Mean :5.526
3rd Qu.:5.070 3rd Qu.:4.424 3rd Qu.:3.808 3rd Qu.:3.610 3rd Qu.:5.489 3rd Qu.:5.605
Max. :6.337 Max. :5.676 Max. :3.978 Max. :5.538 Max. :5.719 Max. :5.718

Summary Statistics for Practice 3

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :4.036 Min. :2.734 Min. :2.932 Min. :3.075 Min. :4.851 Min. :4.926
1st Qu.:4.961 1st Qu.:4.185 1st Qu.:3.510 1st Qu.:3.338 1st Qu.:5.255 1st Qu.:5.448
Median :5.098 Median :4.533 Median :3.651 Median :3.483 Median :5.417 Median :5.534
Mean :5.104 Mean :4.390 Mean :3.629 Mean :3.598 Mean :5.376 Mean :5.507
3rd Qu.:5.240 3rd Qu.:4.791 3rd Qu.:3.835 3rd Qu.:3.689 3rd Qu.:5.501 3rd Qu.:5.598
Max. :6.013 Max. :5.894 Max. :4.050 Max. :4.979 Max. :5.705 Max. :5.744

Classifications Example

Practice 1 Practice 2 Practice 3 Predicted Class True Class
0.1432 0.3111 0.5457 Practice 3 Practice 2
0.4642 0.0133 0.5225 Practice 3 Practice 2
0.1632 0.2309 0.6058 Practice 3 Practice 2
0.1758 0.1984 0.6257 Practice 3 Practice 3
0.1126 0.5630 0.3244 Practice 2 Practice 3
0.1211 0.4071 0.4718 Practice 3 Practice 3
0.1293 0.3463 0.5244 Practice 3 Practice 3
0.1323 0.3746 0.4931 Practice 3 Practice 1
0.2524 0.0897 0.6580 Practice 3 Practice 1
0.2214 0.1247 0.6539 Practice 3 Practice 1

Column

Class Breakdown

Class Locations

Confusion Matrix

Logistic Regression

Column

Explanation

Research Question: Is a Logistic Regression model better than the null model at predicting whether an entry is a race or a practice?

\(H_0\): No, the null model is the better model for predicting whether an entry is a race or practice.

\(H_1\): Yes, the Logistic Regression model is the better model for predicting whether an entry is a race or practice.

As specified above, here we are interested in classifying the entries into Race and Practice. Overall there are 3 different race types: Race, Sprint, and Sprint Shootout, as well as 4 Practice types: Practice 1, Practice 2, Practice 3, and Qualifying. We originally used the full model with all the available variables, but since Logistic Regression is a form of the glm (generalized linear model) function, we were able to use stepwise regression to narrow down the variables. The remaining variables included all of the numerical variables, as well as the year and meeting place name.

On the right, we show a breakdown of Race vs Practice percentages in the data, as well as graphs of the numerical data colored by class. Our last chart on the right is the confusion matrix showing that the model had a very high detection and success for the “Race” class but a less than half for the “Practice” class were predicted correctly. Overall the accuracy of the model was:

Accuracy
0.9140625

This accuracy looks much nicer compared to the accuracy of the Naive Bayes model, however the null model would be able to achieve roughly 70% accuracy simply by choosing “Race” every time. So while this model is better at predicting classes than the null, it is only better by a few percentage points. Looking at the plotted points in scatterplots, it would be worth attempting to classify this based on k-NN since there are similar groupings. However, we still have proven that the Logistic Regression model is more accurate and thus we have shown to have sufficient statistical evidence to say the Logistic Regression model is better than the null model.

Summary Statistics for Practice

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :-4.72438 Min. :-2.1773 Min. :-5.204720 Min. :-2.0222 Min. :-3.85844 Min. :-9.37728
1st Qu.:-0.05902 1st Qu.:-0.1115 1st Qu.:-0.438566 1st Qu.:-0.5047 1st Qu.:-0.68549 1st Qu.:-0.56746
Median : 0.33213 Median : 0.4375 Median : 0.067485 Median :-0.1098 Median : 0.28031 Median : 0.03809
Mean : 0.34474 Mean : 0.3200 Mean : 0.003027 Mean : 0.1556 Mean : 0.04054 Mean :-0.17133
3rd Qu.: 0.72146 3rd Qu.: 0.8590 3rd Qu.: 0.776976 3rd Qu.: 0.3771 3rd Qu.: 0.77206 3rd Qu.: 0.49404
Max. : 5.09201 Max. : 3.8076 Max. : 2.381576 Max. : 5.1511 Max. : 1.93802 Max. : 1.57330

Summary Statistics for Race

avg_duration avg_duration_sec1 avg_duration_sec2 avg_duration_sec3 avg_i1_speed avg_st_speed
Min. :-3.5507 Min. :-2.0480 Min. :-2.799308 Min. :-2.064175 Min. :-3.31636 Min. :-3.45631
1st Qu.:-1.3935 1st Qu.:-1.1736 1st Qu.:-0.599634 1st Qu.:-0.817456 1st Qu.:-0.95683 1st Qu.:-0.05357
Median :-0.9971 Median :-0.9676 Median : 0.129379 Median :-0.454300 Median :-0.08776 Median : 0.61466
Mean :-0.8754 Mean :-0.8125 Mean :-0.007686 Mean :-0.395036 Mean :-0.10295 Mean : 0.43507
3rd Qu.:-0.6438 3rd Qu.:-0.6712 3rd Qu.: 0.604307 3rd Qu.: 0.008255 3rd Qu.: 0.83364 3rd Qu.: 1.15002
Max. : 2.7823 Max. : 2.4860 Max. : 2.324628 Max. : 2.914064 Max. : 2.15583 Max. : 1.75966

Column

Class Breakdown

Variable Slice 1

Variable Slice 2

Confusion Martix